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Abstract 

Suppose we are given two probability measures on the set of one-way in- 
finite finite-alphabet sequences and consider the question when one of the 
measures predicts the other, that is, when conditional probabilities converge 
(in a certain sense) when one of the measures is chosen to generate the se- 
quence. This question may be considered a refinement of the problem of 
sequence prediction in its most general formulation: for a given class of prob- 
ability measures, does there exist a measure which predicts all of the measures 
in the class? To address this problem, we find some conditions on local abso- 
lute continuity which are sufficient for prediction and which generalize several 
different notions which are known to be sufficient for prediction. We also for- 
mulate some open questions to outline a direction for finding the conditions 
on classes of measures for which prediction is possible. 
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1 Introduction 



Let a sequence Xt, t e JN of letters from some finite alphabet X be generated by 
some probability measure /i. Having observed the first n letters Xi,...,Xn we want 
to predict what is the probability of the next letter being x, for each x EX. This 
task is motivated by numerous applications — from weather forecasting and stock 
market prediction to data compression. 

If the measure // is known completely then the best forecasts one can make for 
the (n+l)st outcome of a sequence xi,...,Xn is yu-conditional probabilities of A' 
given Xi,...,Xn. On the other hand, it is immediately apparent that if nothing is 
known about the distribution generating the sequence then no prediction is pos- 
sible, since for any predictor there is a measure on which it errs (gives inadequate 
probability forecasts) on every step. Thus one has to restrict the attention to some 
class of measures. Laplace was perhaps the first to address the question of sequence 
prediction, his motivation being as follows: Suppose that we know that the Sun has 
risen every day for 5000 years, what is the probability that it will rise tomorrow? 
He suggested to assume that the probability that the Sun rises is the same every 
day and the trials are independent of each other. Thus Laplace considered the task 
of sequence prediction when the true generating measure belongs to the family of 
Bernoulli i.i.d. measures with binary alphabet A" = {0,1}. The predicting measure 
suggested by Laplace was PL{xn+i = l\xi,...,Xn) = ^^ where k is the number of Is in 
Xi,...,Xn. The conditional probabilities of Laplace's measure converge to the true 
conditional probabilities /x-almost surely under any Bernoulli i.i.d measure p. This 
approach generalizes to the problem of predicting any finite-memory (e.g. Marko- 
vian) measure. Moreover, in [Rya88] a measure was constructed for predicting 
an arbitrary stationary measure. The conditional probabilities of pr converge to 
the true ones on average, where average is taken over time steps (that is, in Cesaro 
sense), //-almost surely for any stationary measure p. However, as it was shown in 
the same work, there is no measTire for which conditional probabilities converge to 
the true ones p-a.s. for every stationary p. Thus we can see that already for the 
problem of predicting outcomes of a stationary measure two criteria of prediction 
arise: prediction in the average (or in Cesaro sense) and prediction on each step, 
and the solution exists only for the former problem. 

But what if the measure generating the sequence is not stationary? A differ- 
ent assumption one can make is that the measure p generating the sequence is 
computable. Solomonoff [Sol64, Eq.(13)] suggested a measure ^ for predicting any 
computable probability measure. The key observation here is that the class of 
all computable probability measures is countable; let us denote it by {i^ijieiN- A 
Bayesian predictor ^ for a countable class of measures {i'i)i(z]N is constructed as 
follows: C,{A) = J2°liWii'i{A) for any measurable set A, where the weights Wi are 
positive and sum to one^. The best predictor for a measure p is the measure p 

^It is not necessary for prediction that the weights sum to one. In [Sol78] and [ZL70] Wj = 2^^(*) 
where K stands for the prefix Kolmogorov complexity, and so the weights do not sum to 1. Further, 
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itself. The Bayesian predictor simply takes the weighted average of the predictors 
for all measures in the class — for countable classes this is possible. It was shown by 
Solomonoff [Sol78] that ^-conditional probabihties converge to //-conditional proba- 
bilities almost surely for any computable measure In fact this is a special case of 
a more general (though without convergence rate) result of Blackwell and Dubins 
[BD62] which states that if a measure n is absolutely continuous with respect to a 
measure p then p converges to p in total variation //-almost surely. Convergence 
in total variation means prediction in a very strong sense — convergence of condi- 
tional probabilities of arbitrary events (not just the next outcome), or prediction 
with arbitrary fast growing horizon. Since for ^ we have ^{A) >Wii'i{A) for every 
measurable set A and for every i/j, each i/j is absolutely continuous with respect to 

Thus the problem of sequence prediction for certain classes of measures (such 
as the class of all stationary measures or the class of all computable measures) was 
often addressed in the literature. Although the mentioned classes of measures are 
sufficiently interesting, it is often hard to decide in applications with which assump- 
tions does a problem at hand comply; not to mention such practical issues as that a 
predicting measure for all computable measures is necessarily non-computable itself. 
Moreover, to be able to generalize the solutions of the sequence prediction problem 
to such problems as active learning, where outcomes of a sequence may depend on 
actions of the predictor, one has to understand better under which conditions the 
problem of sequence prediction is solvable. In particular, in active learning, the 
stationarity assumption does not seem to be applicable (since the predictions are 
non-stationary), although, say, the Markov assumption is often applicable and is 
extensively studied. Thus, we formulate the following general questions which we 
start to address in the present work: 

General motivating questions. For which classes of measures is sequence pre- 
diction possible? Under which conditions does a measure p predict a measure /i? 

As we have seen, these questions have many facets, and in particular there are 
many criteria of prediction to be considered, such as almost sure convergence of 
conditional probabilities, convergence in average, etc. Extensive as the literature 
on sequence prediction is, these questions in their full generality have not received 
much attention. One line of research which exhibits this kind of generality consists 
in extending the result of Blackwell and Dubins mentioned above, which states that 
if p is absolutely continuous with respect to p, then p predicts p in total variation 
distance. In [JKS99] a question of whether, given a class of measures C and a 
prior ("meta" -measure) A over this class of measures, the conditional probabilities 
of a Bayesian mixture of the class C w.r.t. A converge to the true /i-probabilities 
(weakly merge, in terminology of [JKS99]) for A-almost any measure p in C. This 
question can be considered solved, since the authors provide necessary and sufficient 
conditions on the measure given by the mixture of the class C w.r.t. A under which 

the u and ^ are only semi-measures. 
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prediction is possible. The major difference from the general questions we posed 
above is that we do not wish to assume that we have a measure on our class of 
measures. For large (non-parametric) classes of measures it may not be intuitive 
which measure over it is natural; rather, the question is whether a "natural" measure 
which can be used for prediction exists. 

To address the general questions posed, we start with the following observation. 
As it was mentioned, for a Bayesian mixture ^ of a countable class of measures ui, 
i e ]N, we have ^{A) > Wii'i{A) for any i and any measurable set A, where Wi is a 
constant. This condition is stronger than the assumption of absolute continuity and 
is sufficient for prediction in a very strong sense. Since we are willing to be satisfied 
with prediction in a weaker sense (e.g. convergence of conditional probabilities), let 
us make a weaker assumption: Say that a measure p dominates a measure p with 
coefficients c„ > if 

p{xi,...,Xn) > Cnp{xi,...,Xn) (l) 

for all xi,...,Xn- 

The first concrete question we pose is, under what conditions on c„ does (1) 
imply that p predicts p7 Observe that if p{xi,...,Xn) >0 for any xi,...,Xn then any 
measure p is locally absolutely continuous with respect to p (that is, the measure 
p restricted to the first n trials p\x-n is absolutely continuous w.r.t. p\x" for each 
n), and moreover, for any measure p some constants c„ can be found that satisfy 
(1). For example, if p is Bernoulli i.i.d. measure with parameter | and p is any 
other measure, then (1) is (trivially) satisfied with c„ = 2"". Thus we know that 
if Cn = c then p predicts p in a very strong sense, whereas exponentially decreasing 
Cn are not enough for prediction. Perhaps somewhat surprisingly, wc will show 
that dominance with any subexponentially decreasing coefficients is sufficient for 
prediction, in a weak sense of convergence of expected averages. Dominance with 
any polynomially decreasing coefficients, and also with coefficients decreasing (for 
example) as c„=exp(— y^/logn), is sufficient for (almost sure) prediction on average 
(i.e. in Cesaro sense). However, for prediction on every step we have a negative result: 
for any dominance coefficients that go to zero there exists a pair of measures p and 
p which satisfy (1) but p does not predict p in the sense of almost sure convergence 
of probabilities. Thus the situation is similar to that for predicting any stationary 
measure: prediction is possible in the average but not on every step. 

Note also that for Laplace's measure p^ it can be shown that p^ dominates any 
i.i.d. measure p with linearly decreasing coefficients c„ = a generalization of 
Pl for predicting all measures with memory k (for a given k) dominates them with 
polynomially decreasing coefficients. Thus dominance with decreasing coefficients 
generalizes (in a sense) predicting countable classes of measures (where we have 
dominance with a constant), absolute continuity (via local absolute continuity), and 
predicting i.i.d. and finite-memory measures. 

Another way to look for generalizations is as follows. The Bayes mixture ^, being 
a sum of countably many measures (predictors), possesses some of their predicting 
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properties. In general, which predictive properties are preserved under summation? 
In particular, if we have two predictors pi and p2 for two classes of measures, we are 
interested in the question whether |(pi+p2) is a predictor for the union of the two 
classes. An answer to this question would improve our understanding of how far 
a class of measures for which a predicting measure exists can be extended without 
losing this property. 

Thus, the second question we consider is the following: suppose that a measure p 
predicts p (in some weak sense), and let x be some other measure (e.g. a predictor for 
a different class of measures). Does the measure p' = \{p+x) still predict pi That 
is, we ask to which prediction quality criteria does the idea of taking a Bayesian 
sum generalize. Absolute continuity is preserved under summation along with it's 
(strong) prediction ability. It was mentioned in [RA06] that prediction in the (weak) 
sense of convergence of expected averages of conditional probabilities is preserved 
under summation. Here we find that several stronger notions of prediction are not 
preserved under summation. 

Thus we address the following two questions. Is dominance with decreasing 
coefficients sufficient for prediction in some sense, under some conditions on the co- 
efficients? And, if a measure p predicts a measure p in some sense, does the measure 
\{p+x) ^-Iso predict p in the same sense, where x is an arbitrary measure? Con- 
sidering different criteria of prediction (a.s. convergence of conditional probabilities, 
a.s. convergence of averages, etc.) in the above two questions we obtain not two but 
many different questions, some of which we answer in the positive and some in the 
negative, yet some are left open. 

Contents. The paper is organized as follows. Section 2 introduces necessary no- 
tation and measures of divergence of probability measures. Section 3 addresses the 
question of whether dominance with decreasing coefficients is sufficient for predic- 
tion, while in Section 4 we consider the question of summing a predictor with an 
arbitrary measure. Both sections 3 and 4 also propose some open questions and 
directions for future research. In Section 5 we discuss some interesting special cases 
of the questions considered, and also some related problems. 

2 Notation and Definitions 

We consider processes on the set of one-way infinite sequences X°° where A" is a 
finite set (alphabet). In the examples we will often assume ^" = {0,1}. The notation 
Xi,n is used for x-y,...,Xn and a;<„ for Xi,...,a;„_i, Xt&X. The symbol p is reserved for 
the "true" measure generating examples. We use E,^ for expectation with respect to 
a measure i> and simply E for E^ (expectation with respect to the "true" measure 
generating examples). 

For two measures p and p define the following measures of divergence. 

{d) KuUblack-Leibler (KL) divergence 
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dn{ll,p\x<n) = 2^fJ.{Xn^x\x<nnOg—. ^ r, 

P\Xn — X\X^Yi ) 



- 1 " 

(d) average KL divergence = -^(it(/i,p|x<„), 



^1=1 



(a) absolute distance a„(//,p|a;<„) = \iJ,{xn — x\x^n) —p{xn — x\x 



_ 1 " 

(a) average absolute distance a„(//,p|xi:„) — —y^at{p,p\x^n). 



Definition 1 (Convergence concepts) We say that p predicts p 
{d) in KL divergence if dn{p,p\x^n)^0 p-a.s., 

(d) in average KL divergence >0 p-a.s., 

(Ed) in expected average KL divergence ifE^dn{p,p\xi;n)^0, 

(a) in absolute distance if an{p,p\x^n)^0 p-a.s., 

(a) in average absolute distance ji/a„(//,p|a;i:„)— >0 p-a.s., 

(Ea) in expected average absolute distance if F,^dn{p,p\xi;r,) ^0- 

The argument Xi:n will be often left implicit in our notation. A measure p 
converges to p in total variation {tv) if sup^(-^(|j°o j^t^\p{A\x^n) ~ p{^\x<n)\ -^0 p- 
almost surely. Some other measures of prediction ability are considered in Section 5. 
The following implications hold (and are complete and strict): 

d ^ d EJ 
^ ^ ^ 
tv ^ a ^ a ^ 'Ea 

to be understood as e.g.: if dn^O a.s. then a„— >0 a.s, or, if EJ„^0 then Ea„^0. 
The horizontal implications =^ follow immediately from the definitions, and the J| 
follow from the following Lemma: 

Lemma 2 (a^ < 2d) For all measures p and p and sequences Xi:oo we have: a^<2dt 
and d?^<2dn and (Ea„)^ <2E(i„. 



Proof. Pinsker's inequality [Hut05, Lem.S.lla] implies <2dt. Using this and 
Jensen's inequality for the average we get 
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-in 1"^ / T ^ \ ' 

2dn = > -E«? > -E«0 

Using this and Jensen's inequality for the expectation E we get 2E(i„>Ea^>(Ea„)^. 
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3 Dominance with Decreasing Coefficients 



First we consider the question whether property (1) is sufficient for prediction. 

Definition 3 (Dominance) We say that a measure p dominates a measure /i with 
coefficients c„ > iff 

P{X1:„) > C„//(X1:„). 

Suppose that p dominates p, with decreasing coefficients c„. Does p predict p in 
(expected, expected average) KL divergence (absolute distance)? First let us give 
an example. 

Proposition 4 (Dominance of Laplace's meeisure) Let pl be the Laplace mea- 
sure, given by piixn+i — <i\xi:n) = n+\x\ ■^^'^ ^'^'^ ^^^2/ ^i.n^^^, where k is 
the number of occurrences of a in xi;n- Then 

for any measure p which generates independently and identically distributed symbols. 
This bound is sharp. 

Proof. We will only give the proof for A' = {0,1}, the general case is analogous. To 
calculate Phixi-.n) observe that it only depends on the number of Os and Is in xi-n 
and not on their order. Thus we compute pL^xi-n) — ^^^n+iy! '^^ere k is the number 
of Is. For any measure p such that p{xn = 1) —p for some pG [0,1] independently for 
all n, and for Laplace measure we have 



pL{Xl:n) 



rP (1 -P) 




< 

for any n-letter word xi,...,Xn where k is the number of Is in it. The bound is 
attained when p = l, so that k — n, p{xi:n) = l, and pi,(a;i;„) = ■ 

Thus for Laplace's measure pi and binary X we have c„ = 0{^). As it was 
mentioned in the introduction, in general, exponentially decreasing coefficients c„ 
are not sufficient for prediction, since (1) is satisfied with p being a Bernoulli i.i.d. 
measure and p any other measure. On the other hand, the following proposition 
shows that in a weak sense of convergence in expected average KL divergence (or 
absolute distance) the property (1) with subexponentially decreasing c„ is sufficient. 
We also remind that if c„ are bounded from below then prediction in the strong 
sense of total variation is possible. 
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Theorem 5 (Ed^O and Ea— >0) Let jj, and p be two measures on X°° and sup- 
pose that p{x\;n) > Cnp{xi:n) foT any xi;n, where Cn are positive constants satisfying 
Mogc~^— >0. Then p predicts p, in expected average KL divergence Ejud„(//,p) — > 
and in expected average absolute distance E^a„(//,p) — >0. 

The proof of this proposition is based on the same idea as the proof of convergence 
of Solomonoff predictor to any of its summands in [Rya88] , see also [Hut05] . 

Proof. For convergence in average expected KL divergence we have 

n t=lxt&X P\Xt\X<t) ^ t=l P\Xt\X<t) 

1^ r=i P{^t\x<t) n p{xi:n) n 

where E* stands for the //-expectation over xt conditional on x<t. 

The statement for expected average distance follows from this and Lemma 2. ■ 

With a stronger condition on c„ prediction in average KL divergence can be 
established. 

Theorem 6 (cZ — ^0 and a — ^0) Let p and p be two measures on and suppose 
that p{xi:n) ^CnPixim) foT cvcry Xi:n, where Cn are positive constants satisfying 

n=l 

Then p predicts p in average KL divergence >0 p-a.s. and in average abso- 

lute distance a„(/x,p)^0 p-a.s. 

In particular, the condition (2) on the coefficients is satisfied for polynomially 
decreasing coefficients, or for c„ = exp(— -y/n/logn). 

Proof. Again the second statement (about absolute distance) follows from the first 

one and Lemma 2, so that we only have to prove the statement about KL divergence. 

Introduce the symbol E" for //-expectation over x„ conditional on a;<„. Consider 
random variables /„ = log ^|^"j^^"j and ln = ^Y.t=i^t- Observe that (i„ = E"/„, so that 
the random variables mn — ln~dn form a martingale difference sequence (that is, 
E"m„ = 0). Let also rfin — ^Yll^-jnt. We will show that m„— >0 /t-a.s. and In^^ 
p-a.s. which implies dn^^ p-a.s. 

Note that 

L = -iog4^ < 1^ ^ 0. 

n p{xi:n) n 
Thus to show that In goes to we need to bound it from below. It is easy to 
see that nln is (/i-a.s.) bounded from below by a constant, since j^^f^ is a p- 
martingale whose expectation is 1, and so it converges to a finite limit p-a,.s. by 
Doob's submartingale convergence theorem, see e.g. [Shi96, p. 508]. 
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Next we will show that m^— ^-O /i-a.s. We have 



run = log^^ -log^^f^ -E"log^f^ + E"log 



= log4N-E"log.^^^^-^ 



p(a^i;n) p(a:i:„) ' 

Let f{n) be some function monotonically increasing to infinity such that 

ffl2ifii±IM)!<oo (3) 

n=l 

(e.g. choose /(n)=logn and exploit (logc;^^ + /(n))^ < 2(logc~^)^ + 2/(n)^ and (2).) 
For a sequence of random variables A„ define 



and A"^-''^ =A„ — A^^-''^. Introduce also 



An if A„ > -f{n) 
otherwise 



m~ = rUn — m'^ and the averages m+ and ra~ . Observe that m+ is a martingale 
difference sequence. Hence to establish the convergence rh^ — * we can use the 
martingale strong law of large numbers [Shi96, p. 501], which states that, for a mar- 
tingale difference sequence A„, if E(nA„)^ < oo and Z^^iEA^/n^ < oo then A„— > 
a.s. Indeed, for the first condition is trivially satisfied (since the expectation in 
question is a finite sum of finite numbers), and the second follows from the fact that 
\m+\<\ogc-^ + f{n) and (3). 
Furthermore, we have 

= log -1 f - E'* log ^ ' ' 



P(xi:n) J V P{xi:n), 

'p{xi:n) 



As it was mentioned before, log ^[^^-"| converges fi-a.s. either to (positive) infinity or 



to a finite number. Hence (log^^^^ is non-zero only a finite number of times, 
and so its average goes to zero. To see that E'^(^log ^|^^'"j j ^0 we write 

^ E K^n\x<n) log + log 

-{/) 



log . + log lji(Xn\x<n) 



and note that the first term in brackets is bounded from below, and so for the sum in 
brackets to be less than —f{n) (which is unbounded) the second term log/i(a;n|a;<„) 
has to go to — oo, but then the expectation goes to zero since \imu^oulogu — 0. 

Thus we conclude that m"— >0 /i-a.s., which together with fh^^O /x-a.s. implies 
m„— >0 ii-a.s., which, finally, together with r„^0 //-a.s. implies fi„^0 /x-a.s. ■ 

However, no form of dominance with decreasing coefficients is sufficient for pre- 
diction in absolute distance or KL divergence, as the following negative result states. 

Proposition 7 {dy^O and ay^O) For each sequence of positive nvmbers Cn that 
goes to there exist measures /x and p and a number e>0 such that p(a;i:„) >c„/i(a;i:„) 
for all xi:n, yet an{fJ.,p\xi:n) >e and dn{fJ.,p\xi:n) >e infinitely often p-a.s. 

Proof. Let p be concentrated on the sequence 11111... (that is /i(a;„ = 1) = 1 for 
all n), and let p{xn = 1) = 1 for all n except for a subsequence of steps n = Uk, 
/cGW on which p(,T„j. = 1) = 1/2 independently of each other. It is easy to see that 
choosing sparse enough we can make p(li...l„) decrease arbitrary slowly; yet 
|//(x„J-p(x„J| = l/2 for all /c. ■ 

Thus for the first question — whether dominance with some coefficients decreas- 
ing to zero is sufficient for prediction, we have the following table of questions and 
answers, where, in fact, positive answers for an are implied by positive answers for 
dn and vice versa for the negative answers: 



Edn 


dn 


dn 




(ifi 


On 


+ 


+ 




+ 


+ 





However, if we take into account the conditions on the coefficients, we see some open 
problems left, and different answers for dn and a„ may be obtained. Following is the 
table of conditions on dominance coefficients and answers to the questions whether 
these conditions are sufficient for prediction (coefficients bounded from below are 
included for the sake of completeness) . 





Edn 


dn 


dn 




On 


an 


logc,„^ = o(n) 


+ 


? 




+ 


? 






+ 


+ 




+ 


+ 




c„ > c > 


+ 


+ 


+ 


+ 


+ 


+ 



We know form Proposition 7 that the condition > c > for convergence in dn 
can not be improved; thus the open problem left is to find whether logc~^ = o{n) is 
sufficient for prediction in c?„ or at least in a„. 

Another open problem is to find whether any conditions on dominance coeffi- 
cients are necessary for prediction; so far we only have some sufficient conditions. 
On the one hand, the obtained results suggest that some form of dominance with 
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decreasing coefficients may be necessary for prediction, at least in the sense of con- 
vergence of averages. On the other hand, the condition (1) is uniform over all 
sequences which probably is not necessary for prediction. As for prediction in the 
sense of almost sure convergence, perhaps more subtle behavior of the ratio ^l^^-"! 
should be analyzed, since dominance with decreasing coefficients is not sufficient for 
prediction in this sense. 

4 Preservation of the Predictive Ability under 
Summation with an Arbitrary Measure 

Now we turn to the question whether, given a measure p that predicts a measure /x in 
some sense, the "contaminated" measure {l—e)p-\-ex for some 0<£<1 also predicts 
jj, in the same sense, where x is an arbitrary measure. Since most considerations 
are independent of the choice of £, in particularly the results in this section, we set 
£ = \ for simplicity. We define 

Definition 8 (Contamination) By "p contaminated with x" w^e mean p':—^{p+ 
X)- 

Positive results can be obtained for convergence in expected average KL diver- 
gence. The statement of the next proposition in a different form was mentioned in 
[RA06, Hut06]. Since the proof is simple we present it here for the sake of complete- 
ness; it is based on the same ideas as the proof of Theorem 5. 

Proposition 9 (Ed— >-0) Let p and p he two measures on X°° and suppose that p 
predicts p in expected average KL divergence. Then so does the measure p' = \{p+x) 
where x is any other measure on 

Proof. 

^ t=ixtex P {Xt\x<:t) n PyXi-.n) 
= -Elog-J ^-^ = Edn{p,p) + -E\og-^ 

where the first term tends to by assumption and the second term is bounded 
from above by Mog2— >0. Since the sum is bounded from below by we obtain the 
statement of the proposition. ■ 

Next we consider some negative results. An example of measures p, p and x 
such that p predicts p in absolute distance (or KL divergence) but \{p+x) does not, 
can be constructed similarly to the example from [KL92] (of a measure p which is 
a sum of distributions arbitrarily close to p yet does not predict it). The idea is to 
take a measure x that predicts p much better than p on almost all steps, but on 
some steps gives grossly wrong probabilities. 
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Proposition 10 (a 7^0 and cZy^O) There exist measures fi, p and x such that p 
predicts p, in absolute distance (KL divergence) but \{p + x) does not predict p in 
absolute distance (KL divergence). 



Proof. Let p be concentrated on the sequence 11111... (that is p{xn — 'i) — 'i for 
any n), and let p{xn = 1) = with probabihties independent on different trials. 
Clearly, p predicts p in both absolute distance and KL divergence. Let x{^n = ^) = ^ 
for all n except on the sequence n = Uk = 2"^ = n1_i, A: G IV on which xi^n^ = 1) = 
Hk-i/uk = 2-2'"'. This implies that xih-.nj = 2/nfc and x(li:„^_i) = x(li:nfe_i) = 
2/nk-i — 2/^/nk. It is now easy to see that |(p+x) does not predict p, neither in 
absolute distance nor in KL divergence. Indeed for n — n^ for some k we have 

P(l<n) +x(l<n) 1/n + 2/v^ 



For the (expected) average absolute distance the negative result also holds: 

Proposition 11 (a 7^0) There exist such measures p, p and x that p predicts p 
in average absolute distance but |(p+x) does not predict p in (expected) average 
absolute distance. 



Proof. Let p be Bernoulli 1/2 distribution and let p{xn= 1) = 1/2 for all n (inde- 
pendently of each other) except for some sequence Uk, /cG W on which p(x„^ = 1) = 0. 
Choose Uk sparse enough for p to predict p in the average absolute distance. Let 
X be Bernoulli 1/3. Observe that x assigns non-zero probabilities to all finite 
sequences, whereas ;U-a.s. from some n on p(xi:„) = 0. Hence ^{p+x){^i:n) — \x{^i:n) 
and so |(p+x) does not predict p. ■ 

Thus for the question of whether predictive ability is preserved when an arbitrary 
measure is added to the predictive measure, we have the following table of answers. 



Edn 


dn 


d'n 
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As it can be seen, there is one open question: whether this property is preserved 
under almost sure convergence of the average KL divergence. 

It can be inferred from the example in Proposition 10 that contaminating a 
predicting measure p with a measure x spoils p if x is better than p on almost every 
step. It thus can be conjectured that adding a measure can only spoil a predictor 
on sparse steps, not affecting the average. 



12 



Conjecture 12 (a^O implies a^O) Suppose that a measure p predicts a mea- 
sure II in absolute distance. Then for any measure x the measure |(p+x) predicts 
II in average absolute distance. 

As far as KL divergence is concerned we expect even a stronger conjecture to 
be true, since limited KL divergence does not allow a predicting measure to be (too 
close to) zero on any step. 

Conjecture 13 (d — > 0) Suppose that a measure p predicts a measure p in average 
KL divergence. Then for any measure x the measure |(p+x) predicts p in average 
KL divergence. 

5 Miscellaneous 

Special cases of dominance with decreasing coefficients. In Section 3 we 
have shown that Laplace's measure pi for A' = {0,1} dominates any Bernoulli i.i.d. 
measure with linearly decreasing coefficients. It can also be shown that a general- 
ization of pl to a measure p'l for predicting any measure with memory k, for a given 
k, dominates any such measure with polynomially decreasing coefficients (namely, 
= (9(nl'^l ). The measure from [Rya88] for predicting any stationary measure 
was constructed as a sum of p| with positive weights: Pr{xi „) = Y^^=iWkp\{xi,,n)- 
By construction, p^ dominates any finite memory measure with polynomially de- 
creasing coefficients. It is interesting to find whether p^ (or any other measure 
which predicts all stationary measures) dominates every stationary measure with 
some subexponentially decreasing coefficients (or at least dominates non-uniformly) . 
Clearly, this is a special case of the general open question — whether some form of 
dominance with decreasing coefficients is necessary for prediction. 

Special questions of summation of a predictor with arbitrary measures. 

Although we know that adding a measure may spoil a predicting measure, it may 
be that carefully selecting which groups of measures to sum we can save all their 
predicting properties. One of the interesting cases is Zvonkin-Levin [ZL70] universal 
semi-computable measure^ 

= J2 Wi^ii^l:n) (4) 

where (i/j), i e W is the class of all lower semi-computable semi- measures, and 
Wi > 0. Since ^(^) > Wii'i{A) for any measurable set A (^ dominates any with 
a constant Wi), it predicts all z/j in the sense of convergence in total variation, in 
KL divergence and absolute distance. The question is what else does it predict, 

^In fact ^ is not a measure but only a senii-nicasure, but a semi-measure is sufficient for making 
predictions and it will not affect our arguments further. 
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which other measures? Laplace's measure pl is computable, and hence is present in 
the sum (4). We know that pi predicts any Bernoulli i.i.d. measure, so we can ask 
whether ^, being a sum of pl and some other measures, still predicts all Bernoulli 
measures. The predictor p/j from [Rya88] is computable and predicts all stationary 
measures in average KL divergence (and average absolute distance). Thus ,^ is a sum 
of pr and some other measure, and we can ask whether it still predicts all stationary 
measures. (In expected average KL divergence this follows from Proposition 9 as 
also pointed out in [Hut06].) 

Conjecture 14 (a— >0 for i.i.d. and d— >0 for stationary) For every i.i.d. 
measure v, the measure ^ as defined in (4) predicts v in absolute distance. For 
every stationary measure v, the measure ^ predicts v in average KL divergence. 



Proof idea. For the first question, consider any Bernoulli i.i.d. measure v. From 
Conjecture 12 and from the fact that px, predicts v in absolute distance we conclude 
that ^ predicts v in average absolute distance. Since v is Bernoulli i.i.d. measure, 
that is, probabilities on all steps are equal and independent of each other, from 
any measure Q that predicts v in average absolute distance we can make a measure 
& which predicts v in absolute distance as follows & {xn = x\x^n) ="^{1^=11 12^^^ t — 
x\x<n)- Moreover, the convergence speed of Q' to v will be the same as that of Q to 
V. But if ^ is semi-computable then so is so that is present in the sum (4). 
Since ^ is not a better predictor than adding ^ to ^ can not spoil the latter. 

If z/ is a stationary measure, then we known from Proposition 9 and that p/j is 
computable that ^ predicts v in expected average KL divergence (absolute distance). 
Conjecture 13 would also imply that ^ predicts v in average KL divergence, and 
average absolute distance. ■ 

Other measures of divergence. The last question we discuss is criteria of pre- 
diction other than introduced in Section 2. Apart form the measures of divergence 
of probability measures that we considered we mention also the following: 

(s) squared distance Sn{p,p\x<_n)=Y.x^x{iJ^{^n = A^<n)-p{^n = A^<n)^-i 

(h) Hellinger distance hn{p,p\x<n)^T,xeA\/lJ'i^n^x\^<n)-\/pixn^A^<ny, 

the average squared distance s„ and the average Hellinger distance /i„ are introduced 
analogously to an and dn- It is easy to check that all negative results obtained 
hold with respect to s„ and /i„ as well. Positive results for s„ and hn follow from 
corresponding positive results for KL divergence dn and inequalities Sn{p,p)<dn{p,p) 
and hn{p,p) <dn{p,p), see e.g. [Hut05, Lem.3.11]. Expected absolute convergence 
Ea„— i^O (also called convergence in the mean) and expected KL convergence E(i„^0 
may also be considered. 
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6 Outlook and Conclusion 



In the present work we formulated and started to address the question for which 
classes of measures sequence prediction is possible. Towards this aim we defined 
the notion of dominance with decreasing coefficients (a condition on local absolute 
continuity) and found some forms of it which are sufficient for prediction. We 
have also addressed the question which forms of predictive ability are preserved 
under "contamination" of a predictor by an arbitrary measure. Besides some more 
concrete open problems posed in the corresponding sections, a program for answering 
the general questions formulated can be outlined as follows: We would like to find 
some conditions on dominance with decreasing coefficients which are necessary and 
sufficient for prediction; for those notions of prediction ability for which this is 
not possible, more subtle behavior of the ratio should be analyzed to obtain 

conditions both necessary and sufficient for prediction. This should give rise to an 
abstract characterization of classes of measures for which a measure satisfying such 
conditions for all measures in the class exists; that is, to a description of classes of 
measures for which prediction is possible. It is expected that such characterization 
will naturally lead to a construction of a predictor as well — perhaps in form of a 
Bayesian integral. The latter conjecture also encourages studying the question of 
"contamination" of a predictor with arbitrary measures. The next step will be to 
extend this approach to the task of active learning [Hut05, RH06]. 
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